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Abstract 

In this paper, we investigate the use of selectional restriction - the con- 
straints a predicate imposes on its arguments - in a language model for speech 
recognition. We use an un-tagged corpus, followed by a public domain tagger 
and a very simple finite state machine to obtain verb-object pairs from unre- 
stricted English text. We then measure the impact the knowledge of the verb 
has on the prediction of the direct object in terms of the perplexity of a cluster- 
based language model. The results show that even though a clustered bigram 
is more useful than a verb-object model, the combination of the two leads to 
an improvement over the clustered bigram model. 



*This is a revised version of the original technical report, prepared for the cmp-lg server. The 
work was supported by NSERC operating grant OGP0041910. 
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1 Introduction 



The constraint a predicate imposes on its arguments is called selectional restriction. 
For example, in the sentence fragment "She eats x", the verb "eat" imposes a con- 
straint on the direct object "x": "x" should be something that is usually being eaten. 
This phenomenon has received considerable attention in the linguistic community 
and it was recently explained in statistical terms in ||R.es93 |. Because it restricts 



subsequent nouns and because nouns account for a large part of the perplexity (see 
|Ueb94|| ), it seems natural to try to use selectional restriction in a language model for 



speech recognition. In this paper, we report on our work in progress on this topic. 
We begin by presenting in section ^ the process we use to obtain training and testing 
data from unrestricted English text. The data constitutes the input to a clustering 
algorithm and a language model, both of which are described in section |^. In section 
we present the results we have obtained so far, followed by conclusions in section 



2 Training and Testing Data 

In order to use selectional restriction in a language model for speech recognition, we 
have to be able to identify the predicate and its argument in naturally occurring, 
unrestricted English text in an efficient manner. Since the parsing of unrestricted 
text is a yet unsolved, complicated problem by itself, we do not attempt to use 
a sophisticated parser. Instead, we use the un-tagged version of the Wall Street 
Journal Corpus (distributed by ACL/DCI), Xerox's public domain tagger (described 
CKPS92[| ) and a very simple deterministic finite-state automaton to identify verbs 



m 



and their direct objects. The resulting data is certainly very noisy, but, as opposed 
to more accurate data obtained from a sophisticated parser, it would be feasible to 
use this method in a speech recogniser. The finite-state automaton we use only has 
three states and it is shown in Figure |l|. The circles correspond to states and the 
arcs to transitions. The input to the automaton consists of a sequence of words with 
associated tags. The words and tags [] are classified into three different events: 

• V : the word is a verb (iff its tag starts with "v" , "b" or "h" ) 

• PP: the word is a preposition (iff its tag is "in") 

• NC: the word starts a new clause (iff its tag is ".",":", ";","!","?","cs" or begins 
with "w") 

All other words and tags do not lead to a change in the state of the automaton. 
Intuitively, state 1 corresponds to the beginning of a new clause without having seen 
its verb, state 2 to a clause after having seen its verb and state 3 to a prepositional 



^ "v" corresponds to most forms of most verbs, "b" corresponds to forms of "to be" , "h" corre- 
sponds to forms of "to have" , "cs" contains words like "although" , "since" and "w" contains words 
like "who" , "when" . The tagset we use is the one provided by the tagger and for more information 



please refer to [CKPS92| 
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phrase. Given this automaton, we then consider each occurrence of a noun (e.g. tag 
"nn") to be 

• a direct object if we are currently in state 2; the corresponding verb is considered 
to be the one that caused the last transition into state 2; 

• part of a prepositional phrase if we are currently in state 3; the corresponding 
preposition is considered to be the one that caused the last transition into state 

3; 

• unconstrained by a predicate or preposition (for example a subject) if we are in 
state 1. 

The automaton then outputs a sequence of verb-object pairs, which constitute our 
training and testing data. Examples of the output of the automaton are shown in 
Table |l]. Entries that would not normally be considered verb-object pairs are marked 
with "*". The data is very noisy because errors can be introduced both by the tagger 
(which makes its decisions based on a small window of words) and by the overly 
simplistic finite state automaton (for example, not all occurrences of "cs" and "w" 
constitute a new clause). Given this data, the goal of our language model is to predict 
the direct objects and we will measure the influence the knowledge of the preceding 
verb has on this prediction in terms of perplexity. 



3 



Verb 


Object 




join 


board 




is 


chairman 


* 


named 


director 




make 


cigarette 




make 


filters 




caused 


percentage 




enters 


lungs 




causing 


symptoms 




show 


decades 


* 


show 


researchers 




Loews 


Corp. 


* 


makes 


cigarettes 




stopped 


crocidohte 




bring 


attention 




This 


old 


* 


having 


properties 




is 


asbestos 


* 


studied 


workers 





Table 1: Examples of the noisy verb-object data 

3 The Language Model 

We can formulate the task of the language model as the prediction of the value 
of some variable Y (e.g. the next word) based on some knowledge about the past 
encoded by variables Xi, X^. In our case, Y is the next direct object and the only 
knowledge we have about the past is the identity of the previous verb encoded by 
the variable X. The most straight forward way would be to directly estimate the 
conditional probability distribution p{Y = yi\X = Xk). However, because of sparse 
training data, it is often difficult to estimate this distribution directly. Class based 
models, that group elements Xk & X into classes = Gi{xk) and elements yi E y into 
classes Qy = G2{yi) can be used to alleviate this problem. The conditional probability 
distribution is then calculated as 

Pcivilxk) = piG2iyi)\Gi{xk)) * piyi\G2iyi)), (1) 

which generally requires less training data0. In the following, let (^[i], ^i^]), 1 < ^ < 
denote the values of X and Y at the i^^ data point and let (G'i[z], G'2[^]) denote 
the corresponding classes. How can we obtain the classification functions Gi and G2 
automatically, given only a set of N data points (-'^[z], Y[i]) ? In the following (sections 
|3.1| , p.2| , |3.3|) , we describe a method which is almost identical to the one presented in 

^As an example, if X and y have 10,000 elements each, and if we use 200 classes for Gi and for 
G2, then the original model has to estimate 10, 000 * 10, 000 = 1 * 10^ probabilities whereas the class 
based model only needs to estimate 200 * 200 + 200 * 10, 000 = 2.04 * 10^. 
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KN93 |. The only difference is that in ||KN93 ], the elements of variables X and Y are 



identical (the data consists of bigrams), thus requiring only one clustering function 
G 0. In our case, however, the variables X and Y contain different elements (verbs 
and objects respectively), and we thus produce two clustering functions Gi and G2- 

3.1 Maximum-Likelihood Criterion 

In order to automatically find classification functions Gi and 6*2, which - as a short- 
hand - we will also denote as G, we first convert the classification problem into an 
optimisation problem. Suppose the function F{G) indicates how good the classifi- 
cation G (composed of Gi and G2) is. We can then reformulate the classification 
problem as finding the classification G that maximises F: 

G = argmaxQ' ^gF{G ), (2) 

where Q contains the set of possible classifications which are at our disposal. 

What is a suitable function F, also called optimisation criterion? Given a classi- 
fication function G, we can estimate the probabilities pG{yi\xk) of equation |I] using 
the maximum likelihood estimator, e.g. relative frequencies: 

Paivilxk) = piG2iyi)\Gi{xk)) * piyi\G2iyi)) (3) 
N{gx,gy) ^ N{gy,y) 



N{9.) * N{gy 

where gx = Gi{xk), gy = G2{yi) and N{x) denotes the number of times x occurs 
in the data. Given these probability estimates pG{yi\xk), the likelihood Fml of the 
training data, e.g. the probability of the training data being generated by our proba- 
bility estimates pG{yi\xk), measures how well the training data is represented by the 
estimates and can be used as optimisation criterion ( ||Jel9CI|| ). 

In the following, we will derive an optimisation function Fml in terms of frequency 
counts observed in the training data. The likelihood of the training data Fml is simply 

N 

Fml = UPciYmii]) (5) 

1=1 

^ N{G,[i],G2[i]) N{G2\t],Y[{\) 

t\ N{Gi[i]) N{G2[^) ■ ^ > 

Assuming that the classification is unique, e.g. that Gi and G2 are functions, we 
have N{G2[i], Y[i]) = N(Y[i]) (because Y[i] always occurs with the same class G'2[?]). 
Since we try to optimise Fml with respect to G, we can remove any term that does 
not depend on G, because it will not influence our optimisation. It is thus equivalent 

■^It is possible that using two clustering functions would be beneficial even if the two variables 
have the same elements. 
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to optimise 



N 

f'ml = unmyi^]) (7) 

i=l 

^ N{G,[{\,G,[i]) 1 

If, for two pairs {X[i],Y[^) and {X[j],Y[j]), we have Gi{X[i]) = Gi{X[j]) and 
G2iY[i\) = G2iY[j]), then we know that f lx[i],Y[i\) = f{X[j],Y[j]). We can thus 
regroup identical terms to obtain 

FML=u[%^*^r'-''\ (9) 

gtgy Ni9x) N{gy) 

where the product is over all possible pairs {gx.gy). Because N^g^) does not depend 
on gy and N{gy) does not depend on g^, we can simplify this again to 

1 N(g^) 1 N{gy) 

gx,gy gx gy y^y) 

Taking the logarithm, we obtain the equivalent optimisation criterion 

f'ml = J2 Ni9.,9y)*logiN{g^,gy))-J2N{g,)*logiN{g,)) (11) 

- Y.Ni9y)*log{N{gy)). 
gy 

Fml is the maximum likelihood optimisation criterion and we could use it to 
find good classifications G. However, the problem with this maximum likelihood 
criterion is that we first estimate the probabilities PG{yi\xk) on the training data T 
and then, given Pciyilxk)^ we evaluate the classification G on T. In other words, 
both the classification G and the estimator pG{yi\xk) are trained on the same data. 
Thus, there will not be any unseen event, a fact that overestimates the power for 
generalisation of the class based model. In order to avoid this, we will in section 
incorporate a cross-validation technique directly into the optimisation criterion. 



3.2 Leaving- One- Out Criterion 

The basic principle of cross-validation is to split the training data T into a "retained" 
part Tji and a "held-out" part T^. We can then use Tr to estimate the probabilities 
PG{yi\xk) for a given classification G, and Th to evaluate how well the classification G 
performs. The so-called leaving-one-out technique is a special case of cross-validation 
( |PH73| , pp.75]). It divides the data into — 1 samples as "retained" part and only 
one sample as "held-out" part. This is repeated — 1 times, so that each sample is 
once in the "held-out" part. The advantage of this approach is that all samples are 
used in the "retained" and in the "held-out" part, thus making very efficient use of 
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the existing data. In other words, our "held-out" part Th to evaluate a classification 
G is the entire set of data points; but when we calculate the probability of the i^^ 
data point, we assume that our probability distribution Pciyil^k) was estimated on 
all the data expect point i. 

Let Ti denote the data without the pair {X[i], Y[i\) aiidpG,Ti{yi\xk) the probability 
estimates based on a given classification G and training corpus Tj. Given a particular 
Tj, the probability of the "held-out" part is PG,Ti{yi\xk)- The probability 

of the complete corpus, where each pair is in turn considered the "held-out" part is 
the leaving-one-out likelihood L^o 

N 

LLO = X{pG,TAYV\\m)- (12) 

i=l 

In the following, we will derive an optimisation function Flo by specifying how 
PG,Ti(^[^] |-^[^]) is estimated from frequency counts. First we rewrite pcTil^i^]!-^!^]) 
as usual (see equation |l|): 

PCtXViW) = PG.TXG2{yi)\Gi{xu))*PG,TMG2{yi)) (13) 
PG,TA9x,9y) ^ PG,TA9y,yi) ^^^-^ 
PG,tX9x) PG,TX9y) 

where = Gi{xk) and gy = G2{yi)- Now we will specify how we estimate each term 
in equation [l^. 

As we saw before, PG,Ti{9y,yi) = PG,Ti{.yi) (if the classification G2 is a function) 
and since prXyi) is actually independent of G, we can drop it out of the maximization 
and thus need not specify an estimate for it. 

As we will see later, we can guarantee that every class gx and gy has been seen at 
least once in the "retained" part and we can thus use relative counts as estimates for 
class uni-grams: 

Pg,tA9x) = — (15) 

PG,TA9y) = ^ ■ (16) 

In the case of the class bi-gram, we might have to predict unseen events 0. We 
therefore use the absolute discounting method ( ||NE93|| ), where the counts are reduced 
by a constant value 6 < 1 and where the gained probability mass is redistributed over 
unseen events. Let no,T, be the number of unseen pairs {gx,gy) and ri+ T. the number 
of seen pairs {gx,gy), leading to the following smoothed estimate 



PG,TX9x,9y) 

NTi{gx,gy)-b 

(17) 



if {9x, 9y) > 

ifNTM,gy)=0 



Hi {X[i],Y[i]) occurs only once in the complete corpus, then pG,Ti (^ [«] l-'^H) will have to be 
calculated based on the corpus Ti, which does not contain any occurrences of 
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Ideally, we would make b depend on the classification, e.g. use b = j^^^2*n2 ' ^^^^^ '^i 
and n2 depend on G. However, due to computational reasons, we use, as suggested 
|KN93|| , the empirically determined constant value b = 0.75 during clustering. The 



m 



probability distribution pG^rXgx-, Qy) will always be evaluated on the "held-out" part 
and with g^^i = Gi[i] and gy^i = 6*2 [i] we obtain 



PG,TA9x,i, gy,i) 



In order to facilitate future regrouping of terms, we can now express the counts 
A^Ti, NtXQx) etc. in terms of the counts of the complete corpus T as follows: 

Nt^ = Nt-1 (19) 

NrXg^) = Nrigx)-! (20) 

NtM = Nrigy)-! (21) 

NTX9x,i, 9y,i) = NT{gx,i, 9y,i) - I (22) 

Nt, = Nt-1 (23) 

_ / n+,T if NT{gx,i, 9y,i) > 1 (^^^ 

~ \n+,T-l if Nrigx,, 9y,) = I ^ ^ 

_ j '^0,T if NT{gx,i, 9y,^) > f^.-. 

~ 1 no,T-l ifNT{gx,,9y,)=l ^ ^ 



All we have left to do now is to substitute all the expressions back into equation |T2 
After dropping PctXhi) because it is independent of G we get 

p. = Tx PO,TA9x,.9y i) ^ 1 ^26) 

f=l VG,TA9x,i) PG,TA9y,t) 

= n ipc,TA9x,9y)r'-'^' * n(-A-T)^^^^^ * ni-A-r)"^^^^^- (27) 

g:„gy PG,tA9x) PG,TA9y) 

We can now substitute equations |16| and |18|, using the counts of the whole corpus 
of equations |19] to ^ . After having dropped terms independent of G, we obtain 

Flo = n iNT{9x,9y) - 1 - br^^-^^^ * f ^7'" (28) 

/ \NT(gx) / ^ \NT{gy) 

* ]l[iNTi9x-l))) *^[iNT{9y-l))) 

where rii^T is the number of pairs {g^, gy) seen exactly once in T (e.g. the number of 
pairs that will be unseen when used as "held-out" part). Taking the logarithm, we 
obtain the final optimisation criterion F^'q 

F'lAo = E NT{gx,9y)*log{NT{gx,9y)-l-b) (29) 

9x,gy.NT{gx,gy)>l 
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-start with an initial clustering function G 
-iterate until some convergence criterion is met 
{ 

-for all X in X and y in Y 
{ 

-for all gx and gy 
{ 

-calculate the difference in F(G) when 

x/y is moved from its current cluster to gx/gy 

} 

-move the x/y that results in the biggest improvement 
in the value of F(G) 

} 

} 



Figure 2: The clustering algorithm 



b*{n+^T-l 



(no,T + 1) 

- ^t{9.) * log{NT{g.,) - 1) - E ^T{9y) * logiNrigy) - 1). 

9x gy 

3.3 Clustering Algorithm 

Given the F'I'q maximization criterion, we use the algorithm in Figure |^ to find a 
good clustering function G. The algorithm tries to make local changes by moving 
words between classes, but only if it improves the value of the optimisation func- 
tion. The algorithm will converge because the optimisation criterion is made up of 
logarithms of probabilities and thus has an upper limit and because the value of the 
optimisation criterion increases in each iteration. However, the solution found by this 
greedy algorithm is only locally optimal and it depends on the starting conditions. 
Furthermore, since the clustering of one word affects the future clustering of other 
words, the order in which words are moved is important. As suggested in |[KJN93|| , we 



sort the words by the number of times they occur such that the most frequent words, 
about which we know the most, are clustered first. Moreover, we do not consider 
infrequent words (e.g. words with occurrence counts smaller than 5) for clustering, 
because the information they provide is not very reliable. Thus, if we start out with 
an initial clustering in which no cluster occurs only once, and if we never move words 
that occur only once, then we will never have a cluster which occurs only once. Thus, 
the assumption we made earlier, when we decided to estimate cluster uni-grams by 
frequency counts, can be guaranteed. 

We will now determine the complexity of the algorithm. Let Mx and My be the 
maximal number of clusters for X and Y , let |X| and |y| be the number of possible 
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values for X and y, let M = max^Mx, My), W = max{\X\, \Y\) and let / be the 
number of iterations. When we move x from to g'^ in the inner loop (the situation 
is symmetrical for y), we need to change the counts N^g^^gy) and N{g'^,gy) for all 
gy. The amount by which we need to change the counts is equal to the number of 
times X occurred with cluster gy. Since this amount is independent of g'^, we need to 
calculate it only once for each x. The amount can then be looked up in constant time 
within the loop, thus making the inner loop of order M. The inner loop is executed 
once for every cluster x can be moved to, thus giving a complexity of the order of M^. 
For each x, we needed to calculate the number of times x occurred with all clusters 
gy. For that we have to sum up all the bigram counts N{x,y) : G2{y) = gy, which 
is on the order of W, thus giving a complexity of the order of W + M^. The two 
outer loops are executed / and W times thus giving a total complexity of the order 
oi I *W * {W + M^). 

4 Results 

In the experiments performed so far, we work with 200, 000 verb-object pairs extracted 
from the 1989 section of the Wall Street Journal corpus. The data contains about 
19, 000 different direct object tokens, about 10, 000 different verb tokens and about 
140, 000 different token pairs. We use | of the data as training and the rest as testing 
data. For computational reasons, we have so far restricted the number of possible 
clusters to 50, but we hope to be able to increase that in the future. The perplexity 
on the testing text using the clustering algorithm on the verb-object pairs is shown in 
Table 0. For comparison, the table also contains the perplexity of a normal uni-gram 
model (e.g. no predictor variable X) and the performance of the clustering algorithm 
on the usual bi-gram data (e.g. the word immediately preceding the direct object as 
predictor variable X). We can see that the verb contains considerable information 
about the direct object and leads to a reduction in perplexity of about 18%. However, 
the immediately preceding word leads to an even bigger reduction of about 34%. We 
also tried a linear interpolation of the two clustered models 

Pinterpol * Pverb— object 

{yi) + (1 - A) * PbigramiVk)- (30) 

On a small set of unseen data, we determined the best value (out of 50 possible values 
in ]0,1[) of the interpolation parameter A. As shown in Table |^, the interpolated 
model leads to an overall perplexity reduction of 43% compared to the uni-gram, 
which corresponds to a reduction of about 10% over the normal bi-gram perplexity. 

5 Conclusions 

From a purely linguistic perspective, it would be slightly surprising to find out that 
the word immediately preceding a direct object can be used better to predict it than 
the preceding verb. However, this conclusion can not be drawn from our results 
because of the noisy nature of the data. In other words, the data contains pairs like 
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Model 


Perplexity 


% Reduction 


uni-gram 


3200 


not appl. 


clustered verb-object 


2640 


18% 


clustered bigram 


2010 


37% 


combined model 


1820 


43% 



Table 2: Perplexity results 



(is, chairman), which would usually not be considered as a verb-direct object pair. 
It is possible, that more accurate data (e.g. fewer, but only correct pairs) would 
lead to a different result. But the problem with fewer pairs would of course be that 
the model can be used in fewer cases, thus reducing the usefulness to a language 
model that would predict the entire text (rather than just the direct objects). The 
results thus support the common language modeling practice, in that bi-gram events 
(by themselves) seem to be more useful than this linguistically derived predictor (by 
itself). Nevertheless, the interpolation results also show that this hnguistically derived 
predictor is useful as a complement to a standard class based bigram model. In the 
future, we hope to consolidate these early findings by more experiments involving a 
higher number of clusters and a larger data set. 
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